Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

ABSTRACT

A cache and TLB layout and design leverage repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A configuration management algorithm dynamically detects phase changes and reacts to an application&#39;s hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration.

REFERENCE TO RELATED APPLICATIONS

The present application is a division of U.S. patent application Ser.No. 09/708,727, filed Nov. 9, 2000, now U.S. Pat. No. 6,684,298.

STATEMENT OF GOVERNMENT INTEREST

This work was supported in part by Air Force Research Laboratory GrantF296091-00-K-0182 and National Science Foundation Grants CCR9701915;CCR9702466; CCR9811929; CDA9401142; and EIA9972881. The government hascertain rights in the invention.

FIELD OF THE INVENTION

The present invention is directed to the optimization of memory cachesand TLB's (translation look-aside buffers) and more particularly todynamic optimization of both speed and power consumption for eachapplication.

DESCRIPTION OF RELATED ART

The performance of general purpose microprocessors continues to increaseat a rapid pace. In the last 15 years, performance has improved at arate of roughly 1.6 times per year with about half of that gainattributed to techniques for exploiting instruction-level parallelismand memory locality. Despite those advances, several impendingbottlenecks threaten to slow the pace at which future performanceimprovements can be realized. Arguably the biggest potential bottlenecksfor many applications in the future will be high memory latency and thelack of sufficient memory bandwidth. Although advances such asnon-blocking caches and hardware and software-based prefetching canreduce latency in some cases, the underlying structure of the memoryhierarchy upon which those approaches are implemented may ultimatelylimit their effectiveness. In addition, power dissipation levels haveincreased to the point where future designs may be fundamentally limitedby that constraint in terms of the functionality that can be included infuture microprocessors. Although several well-known organizationaltechniques can be used to reduce the power dissipation in on-chip memorystructures, the sheer number of transistors dedicated to the on-chipmemory hierarchy in future processors (for example, roughly 92% of thetransistors on the Alpha 21364 are dedicated to caches) requires thatthose structures be effectively used so as not to needlessly waste chippower. Thus, new approaches that improve performance in a moreenergy-efficient manner than conventional memory hierarchies are neededto prevent the memory system from fundamentally limiting futureperformance gains or exceeding power constraints.

The most commonly implemented memory system organization is likely thefamiliar multi-level memory hierarchy. The rationale behind thatapproach, which is used primarily in caches but also in some TLBs (e.g.,in the MIPS R10000), is that a combination of a small, low-latency L1memory backed by a higher capacity, yet slower, L2 memory and finally bymain memory provides the best tradeoff between optimizing hit time andmiss time. Although that approach works well for many common desktopapplications and benchmarks, programs whose working sets exceed the L1capacity may expend considerable time and energy transferring databetween the various levels of the hierarchy. If the miss tolerance ofthe application is lower than the effective L1 miss penalty, thenperformance may degrade significantly due to instructions waiting foroperands to arrive. For such applications, a large, single-level cache(as used in the HP PA-8X00 series of microprocessors) may perform betterand be more energy-efficient than a two-level hierarchy for the sametotal amount of memory. For similar reasons, the PA-8X00 series alsoimplements a large, single-level TLB. Because the TLB and cache areaccessed in parallel, a larger TLB can be implemented without impactinghit time in that case due to the large L1 caches that are implemented.

The fundamental issue in current approaches is that no one memoryhierarchy organization is best suited for each application. Across adiverse application mix, there will inevitably be significant periods ofexecution during which performance degrades and energy is needlesslyexpended due to a mismatch between the memory system requirements of theapplication and the memory hierarchy implementation.

The inventors' previous approaches to that problem have exploited thepartitioning of hardware resources to enable/disable parts of the cacheunder software control, but in a limited manner. The issues of how topractically implement such a design were not addressed in detail, theanalysis only looked at changing configurations on anapplication-by-application basis (and not dynamically during theexecution of a single application), and the simplifying assumption wasmade that the best configuration was known for each application.Furthermore, the organization and performance of the TLB were notaddressed, and the reduction of the processor clock frequency withincreases in cache size limited the performance improvement which couldbe realized.

Recently, Ranganathan, Adve, and Jouppi in “Reconfigurable caches andtheir application to media processing,” Proceedings of the 27thInternational Symposium on Computer Architecture, pages 214-224, June,2000, proposed a reconfigurable cache in which a portion of the cachecould be used for another function, such as an instruction reuse buffer.Although the authors show that such an approach only modestly increasescache access time, fundamental changes to the cache may be required sothat it may be used for other functionality as well, and long wiredelays may be incurred in sourcing and sinking data from potentiallyseveral pipeline stages.

Furthermore, as more and more memory is integrated on-chip andincreasing power dissipation threatens to limit future integrationlevels, the energy dissipation of the on-chip memory is as important asits performance. Thus, future memory-hierarchy designs must also beenergy-aware by exploiting opportunities to trade off negligibleperformance degradation for significant reductions in power or energy.No satisfactory way of doing so is yet known in the art.

SUMMARY OF THE INVENTION

It will be readily apparent from the above that a need exists in the artto optimize the memory hierarchy organization for each application. Itis therefore an object of the invention to reconfigure a cachedynamically for each application.

It is another object of the invention to improve both memory hierarchyperformance and energy consumption.

To achieve the above and other objects, the present invention isdirected to a cache in which a configuration management algorithmdynamically detects phase changes and reacts to an application's hit andmiss intolerance in order to improve memory hierarchy performance whiletaking energy consumption into consideration.

The present invention provides a configurable cache and TLB orchestratedby a configuration algorithm that can be used to improve the performanceand energy-efficiency of the memory hierarchy. A noteworthy feature ofthe present invention is the exploitation of the properties ofconventional caches and future technology trends in order to providecache and TLB configurability in a low-intrusive manner.

The present invention monitors cache and TLB usage and applicationlatency tolerance at regular intervals by detecting phase changes usingmiss rates and branch frequencies, and thereby improves performance byproperty balancing hit latency intolerance with miss latency intolerancedynamically during application execution (using CPI, or cycles perinstruction, as the ultimate performance metric). Furthermore, insteadof changing the clock rate, the present invention provides a cache andTLB with a variable latency so that changes in the organization of thosestructures only impact memory instruction latency and throughput.Finally, energy-aware modifications to the configuration algorithm areimplemented that trade off a modest amount of performance forsignificant energy savings.

When applied to a two-level cache and TLB hierarchy at 0.1 μmtechnology, the result is an average 15% reduction in cycles perinstruction (CPI), corresponding to an average 27% reduction inmemory-CPI, across a broad class of applications compared to the bestconventional two-level hierarchy of comparable size. Projecting tosub-0.1 μm technology design considerations which call for a three-levelconventional cache hierarchy for performance reasons, a configurableL2/L3 cache hierarchy coupled with a conventional L1 results in anaverage 43% reduction in memory hierarchy energy in addition to improvedperformance.

The present invention significantly expands upon the inventors' previousresults which addressed only performance in a limited manner for onetechnology point (0.1 μm) using a different (more hardware-intensive)configuration algorithm. The present invention provides a configurablehierarchy as a L1/L2 replacement in 0.1 μm technology, and as an L2/L3replacement for a 0.035 μm feature size. For the former, the presentinvention provides an average 27% improvement in memory performance,which results in an average 15% improvement in overall performance ascompared to a conventional memory hierarchy. Furthermore, theenergy-aware enhancements bring memory energy dissipation in line with aconventional organization, while still improving memory performance by13% relative to the conventional approach. For 0.035 μm geometries,where the prohibitively high latencies of large on-chip caches call fora three-level conventional hierarchy for performance reasons, aconfigurable L2/L3 cache hierarchy coupled with a conventional L1reduces overall memory energy by 43% while even slightly increasingperformance. That latter result demonstrates that because theconfigurable approach significantly improves memory hierarchyefficiency, it can serve as a partial solution to the significant powerdissipation challenges facing future processor architects.

BRIEF DESCRIPTION OF THE DRAWINGS

A preferred embodiment of the present invention will be set forth indetail with reference to the drawings, in which:

FIG. 1 shows an overall organization of the cache data arrays used inthe preferred embodiment;

FIG. 2 shows the organization of one of the cache data arrays of FIG. 1;

FIG. 3 shows possible L1/L2 cache organizations which can be implementedin the cache data arrays of FIGS. 1 and 2;

FIG. 4 shows the organization of a configurable translation look-asidebuffer according to the preferred embodiment;

FIG. 5 shows memory CPI for conventional, interval-based andsubroutine-based configurable schemes;

FIG. 6 shows total CPI for conventional, interval-based andsubroutine-based configurable schemes;

FIG. 7 shows memory EPI in nanojoules for conventional, interval-basedand energy-aware configurable schemes;

FIG. 8 shows memory CPI for conventional, interval-based andenergy-aware configurable schemes;

FIG. 9 shows memory CPI for conventional three-level and dynamic cachehierarchies;

FIG. 10 shows memory EPI in nanojoules for conventional three-level anddynamic cache hierarchies;

FIG. 11 shows a flow chart of operations performed in reconfiguring acache; and

FIG. 12 shows a flow chart of operations performed in reconfiguring atranslation look-aside buffer.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

A preferred embodiment of the present invention will now be set forth indetail with reference to the drawings.

The cache and TLB structures (both conventional and configurable) followthe structure described by G. McFarland, CMOS Technology Scaling and ItsImpact on Cache Delay, Ph.D. thesis, Stanford University, June, 1997.McFarland developed a detailed timing model for both the cache and TLBwhich balances both performance and energy considerations in subarraypartitioning, and which includes the effects of technology scaling.

The preferred embodiment starts with a conventional 2 MB data cache 101organized both for fast access time and for energy efficiency. As isshown in FIG. 1, the cache 101 is structured as two 1 MB interleavedbanks 103, 105, each with a data bus 107 or 109. The banks 103, 105 areword-interleaved when used as an L1/L2 replacement and block-interleavedwhen used as an L2/L3 replacement. Such structuring is done in order toprovide sufficient memory bandwidth for a four-way issue dynamicsuperscalar processor. In order to reduce access time and energyconsumption, each 1 MB bank 103, 105 is further divided into two 512 KBSRAM structures or subarrays 111, 113, 115, 117, one of which isselected on each bank access. A number of modifications are made to thatbasic structure to provide configurability with little impact on accesstime, energy dissipation, and functional density.

The data array section of the configurable structure 101 is shown inFIG. 2 in which only the details of one subarray 113 are shown forsimplicity. (The other subarrays 111, 115, 117 are identicallyorganized). There are four subarrays 111, 113, 115, 117, each of whichcontains four ways 201, 203, 205, 207 and has a precharge 208. A rowdecoder 209 having a pre-decoder 211 is connected to each subarray 111,113, 115, 117 by a global wordline 213 and to the ways 201, 203, 205,207 in each subarray 111, 113, 115, 117 by a local wordline 215. Eachsubarray 111, 113, 115, 117 communicates via column MUXers 217 and senseamps 219 with a data bus 221. A cache select logic 223 controlssubarray/way select in accordance with a subarray select from theaddress, a tag hit from tags and a configuration control from aconfiguration register. In both the conventional and configurable cache,two address bits (Subarray Select) are used to select only one of thefour subarrays 111, 113, 115, 117 on each access in order to reduceenergy dissipation. The other three subarrays have their local wordlines215 disabled and their recharge 208, sense amp 219, and output drivercircuits are not activated. The TLB virtual to real page numbertranslation and tag check proceed in parallel and only the outputdrivers for the way in which the hit occurred are turned on. ParallelTLB and tag access can be accomplished if the operating system canensure that index_bits-page_offset_bits bits of the virtual and physicaladdresses are identical, as is the case for the four-way set associative1 MB dual-banked L1 data cache in the HP PA-8500.

In order to provide configurability while retaining fast access times,several modifications are made to McFarland's baseline design as shownin FIG. 2:

1. McFarland drives the global wordlines to the center of each subarrayand then the local wordlines across half of the subarray in eachdirection in order to minimize the worst-case delay. In the configurablecache, because comparable delay with a conventional design for thesmallest cache configurations is sought, the global wordlines 213 aredistributed to the nearest end of each subarray 111, 113, 115, 117 anddrive the local wordlines 215 across the entire subarray 111, 113, 115,117.

2. McFarland organizes the data bits in each subarray by bit number.That is, data bit 0 from each way are grouped together, then data bit 1,etc. In the configurable cache, the bits are organized according to ways201, 203, 205, 207 as shown in FIG. 2 in order to increase the number ofconfiguration options.

3. Repeater switches 225 are used in the global wordlines 213 toelectrically isolate each subarray. That is, subarrays 113 and 115 donot suffer additional global wordline delay due to the presence ofsubarrays 111 and 117. Providing switches as opposed to simple repeatersalso prevents wordline switching in disabled subarrays thereby savingdynamic power.

4. Repeater switches 227 are also used in the local wordlines toelectrically isolate each way 201, 203, 205, 207 in a subarray. Theresult is that the presence of additional ways does not impact the delayof the fastest ways. Dynamic power dissipation is also reduced bydisabling the wordline drivers of disabled ways.

5. Configuration Control signals received from the ConfigurationRegister through the cache select logic 223 provide the ability todisable entire subarrays 111, 113, 115, 117 or ways 201, 203, 205, 207within an enabled subarray. Local wordline and data output drivers andprecharge and sense amp circuits 208, 219 are not activated for adisabled subarray or way.

Using McFarland's area model, the additional area from adding repeaterswitches to electrically isolate wordlines is estimated to be 7%. Inaddition, due to the large capacity (and resulting long wordlines) ofeach cache structure, each local wordline is roughly 2.75 mm in length(due to the size of the cache) at 0.1 μm technology, and a fasterpropagation delay is achieved with those buffered wordlines comparedwith unbuffered lines. Moreover, because local wordline drivers arerequired in a conventional cache, the extra drivers required to isolateways within a subarray do not impact the spacing of the wordlines, andthus bitline length is unaffected. In terms of energy, the addition ofrepeater switches increases the total memory hierarchy energydissipation by 2-3% in comparison with a cache with no repeaters for thesimulated benchmarks.

With the above modifications, the cache behaves as a virtual two-level,physical one-level, non-inclusive cache hierarchy, with the sizes,associativities, and latencies of the two levels dynamically chosen. Inother words, a single large cache organization serves as a configurabletwo-level non-inclusive cache hierarchy, where the ways within eachsubarray which are initially enabled for an L1 access are varied tomatch application characteristics. The latency of the two sections ischanged on half-cycle increments according to the timing of eachconfiguration (and assuming a 1 GHz processor). Half cycle incrementsare required to provide the granularity to distinguish the differentconfigurations in terms of their organization and speed. Such anapproach can be implemented by capturing cache data using both phases ofthe clock, similar to the double-pumped Alpha 21264 data cache, andenabling the appropriate latch according to the configuration. Theadvantages of that approach are that the timing of the cache can changewith its configuration while the main processor clock remainsunaffected, and that no clock synchronization is necessary between thepipeline and cache.

However, because a constant two-stage cache pipeline is maintainedregardless of the cache configuration, cache bandwidth degrades for thelarger, slower configurations. Furthermore, the implementation of acache whose latency can vary on half-cycle increments requires twopipeline modifications. First, the dynamic scheduling hardware must beable to speculatively issue (assuming a data cache hit) load-dependentinstructions at different times depending on the currently enabled cacheconfiguration. Second, for some configurations, running the cache onhalf-cycle increments requires an extra half-cycle for accesses to becaught by the processor clock phase. Some configurations may have a halfcycle difference between the two pipeline stages that are assumed foreach cache configuration.

When used as a replacement for a conventional L1/L2 on-chip cachehierarchy, the possible configurations are shown in FIG. 3. That figureshows the possible L1/L2 cache organizations which can be configured, asshown by the various allocations of the ways to L1 and L2. Only one ofthe four 512 KB SRAM structures is shown. Abbreviations for eachorganization are listed to the left of the size and associativity of theL1 section, while L1 access times in cycles are given on the right. Notethat the TLB access may dominate the overall delay of someconfigurations. The numbers listed simply indicate the relative order ofthe access times for all configurations and thus the size/access timetradeoffs allowable.

Although multiple subarrays may be enabled as L1 in an organization, asin a conventional cache, only one is selected each access according tothe Subarray Select field of the address. When a miss in the L1 sectionis detected, all tag subarrays and ways are read. That permits hitdetection to data in the remaining portion of the cache (designated asL2 in FIG. 3). When such a hit occurs, the data in the L1 section (whichhas already been read out and placed into a buffer) is swapped with thedata in the L2 section. In the case of a miss to both sections, thedisplaced block from the L1 section is placed into the L2 section. Thatprevents thrashing in the case of low-associative L1 organizations.

The direct-mapped 512 KB and two-way set associative 1 MB cacheorganizations are lower energy, and lower performance, alternatives tothe 512 KB two-way and 1 MB four-way organizations, respectively. Thoseoptions activate half the number of ways on each access for the samecapacity as their counterparts. For execution periods in which there arefew cache conflicts and hit latency tolerance is high, the low energyalternatives may result in comparable performance yet potentially saveconsiderable energy. Those configurations are used in an energy-awaremode of operation as described below.

Because some of the configurations span only two subarrays, while othersspan four, the number of sets is not always the same. Hence, it ispossible that a given address might map into a certain cache line at onetime and into another at another time (called a mis-map). In cases wheresubarrays two and three are disabled, the high-order Subarray Selectsignal is used as a tag bit. That extra tag bit is stored on allaccesses in order to detect mis-maps and to handle the case in whichdata is loaded into subarray 0 or 1 during a period when subarrays 2 or3 are disabled, but then maps into one of those latter two subarraysupon their being re-enabled. That case is detected in the same manner asdata in a disabled way. If the data is found in a disabled subarray, itis transferred to the correctly mapped subarray. Simulation-basedanalysis indicates that such events occur infrequently for mostapplications. Mis-mapped data is handled the same way as a L1 miss andL2 hit, i.e., it results in a swap. Simulations indicate that suchevents are infrequent.

In sub-0.1 μm technologies, the long access latencies of a large on-chipL2 cache may be prohibitive for those applications which make use ofonly a small fraction of the L2 cache. Thus, for performance reasons, athree-level hierarchy with a moderate size (e.g., 512 KB) L2 cache willbecome an attractive alternative to two-level hierarchies at thosefeature sizes. However, the cost may be a significant increase in energydissipation due to transfers involving the additional cache level. Itwill be demonstrated below that the use of the aforementionedconfigurable cache structure as a replacement for conventional L2 and L3caches can significantly reduce energy dissipation without anycompromise in performance as feature sizes scale below 0.1 μm.

A 512-entry, fully-associative TLB 401 can be similarly configured, asshown in FIG. 4. There are eight TLB increments 403, each of whichcontains a CAM 405 of 64 virtual page numbers and an associated RAM 407of 64 physical page numbers. Switches 409 are inserted on the input andoutput buses 411, 413 to electrically isolate successive increments.Thus, the ability to configure a larger TLB does not degrade the accesstime of the minimal size (64 entry) TLB. Similar to the cache design,TLB misses result in a second access but to the backup portion of theTLB.

The configurable cache and TLB layout makes it possible to turn offselected repeaters, thus enabling only a fraction of the cache and TLBat any time. For the L1/L2 reconfiguration, that fraction represents anL1 cache, while the rest of the cache serves as a non-inclusive L2 whichis looked up in the event of an L1 miss. Thus, L1 hit time is traded offwith L1 miss time to improve performance. That structure can also takethe place of an L2-L3 hierarchy. Trading off hit and miss time alsoreduces the number of cache-to-cache transfers, thus reducing the cachehierarchy energy dissipation.

Dynamic selection mechanisms will now be disclosed. First, the selectionmechanisms for the configurable cache and TLB when used as a replacementfor a conventional L1/L2 on-chip hierarchy will be disclosed. Then, themechanisms as applied to a configurable L2/L3 cache hierarchy coupledwith a conventional fixed-organization L1 cache will be disclosed.

The configurable cache and TLB approach makes it possible to pickappropriate configurations and sizes based on application requirements.The different configurations spend different amounts of time and energyaccessing the L1 and the lower levels of the memory hierarchy.Heuristics improve the efficiency of the memory hierarchy by trying tominimize idle time due to memory hierarchy access. The goal is todetermine the right balance between hit latency and miss rate for eachapplication phase based on the tolerance of the phase for tie hit andmiss latencies. The selection mechanisms are designed to improveperformance, and modifications are introduced to the heuristics whichopportunistically trade off a small amount of performance forsignificant energy savings. Those heuristics require appropriate metricsfor assessing the cache/TLB performance of a given configuration duringeach application phase.

Cache miss rates give a first order approximation of the cacherequirements of an application, but they do not directly reflect theeffects of various cache sizes on memory stall cycles. Here, a metric isfirst presented which quantifies that effect, and the manner in which itcan be used to dynamically pick an appropriate cache configuration isdescribed. The actual number of memory stall cycles is a function of thetime taken to satisfy each cache access and the ability of theout-of-order execution window to overlap other useful work while thoseaccesses are made. In the prior art, load latency tolerance has beencharacterized, and two hardware mechanisms have been introduced forestimating the criticality of a load. One of those monitors the issuerate while a load is outstanding and the other keeps track of the numberof instructions dependent on that load. While those schemes are easy toimplement, they are not very accurate in capturing the number of stallcycles resulting from an outstanding load. The preferred embodiment moreaccurately characterizes load stall time and further breaks that down asstalls due to cache hits and misses. The goal is to provide insight tothe selection algorithm as to whether it is necessary to move to alarger or smaller L1 cache configuration (or not to move at all) foreach application phase.

A simple mechanism will be described with reference to the flow chart ofFIG. 11. The initial scheme or state, set in step 1101, is tuned toimprove performance and thus explores the following five cacheconfigurations: direct-mapped 256 KB L1, 768 KB 3-way L1, 1 MB 4-way L1,1.5 MB 3-way L1, and 2 MB 4-way L1. The 512 KB 2-way L1 configurationprovides no performance advantage over the 768 KB 3-way L1 configuration(due to their identical access times in cycles) and thus thatconfiguration is not used. For similar reasons, the two low-energyconfigurations (direct-mapped 512 KB L1 and two-way set associative 1 MBL1) are only used with modifications to the heuristics which reduceenergy (described shortly).

At the end of each interval of execution (step 1103; 100 K cycles in thesimulations), a set of hardware counters is examined in step 1105. Thosehardware counters provide the miss rate, the IPC, and the branchfrequency experienced by the application in that last interval. Based onthat information, the selection mechanism, (which could be implementedin software or hardware) picks one of two states in step 1107—stable orunstable. The former suggests that behavior in that interval is not verydifferent from the last and that it is not necessary to change the cacheconfiguration, while the latter suggests that there has recently been aphase change in the program and that an appropriate size needs to bepicked.

The initial state set in step 11101 is unstable, and the initial L1cache is chosen to be the smallest (256 KB in the preferred embodiment).At the end of an interval, the CPI experienced for that cache size isentered into a table in step 1109. If the miss rate exceeds a certainthreshold (1% in the preferred embodiment) during that interval, asdetermined in step 1111, and the maximum L1 size is not reached, asdetermined in step 1113, the next larger L1 cache configuration isadopted for the next interval of operation in step 1115 in an attempt tocontain the working set. That exploration continues until the maximum L1size is reached or until the miss rate is sufficiently small. At thatpoint, in step 1117, the table is examined, the cache configuration withthe lowest CPI is picked, the table is cleared, and the stable state isswitched to. The cache remains in the stable state while the number ofmisses and branches does not significantly differ from that in theprevious interval, as determined in step 1119. When there is a change,then in step 1121, the unstable state is switched to, the smallest L1cache configuration is returned to, and the exploration starts again.The above is repeated in step 1123 for the next interval. Thepseudo-code for the mechanism is listed below.

if (state==STABLE) if ((num_miss-last_num_miss)<m_noise&&(num_br-last_num_br)<br_noise) decr m_noise, br_noise; elsecache_size=SMALLEST; stat=UNSTABLE; if (state==UNSTABLE) record CPI; if((miss_rate>THRESHOLD) && (cache_size !=MAX)) cache_size++; elsecache_size=that with best CPI; state=STABLE; if(cache_size==prev_cache_size) incr br_noise, m_noise;

Different applications see different variations in the number of missesand branches as they move across application phases. Hence, instead ofusing a single fixed number as the threshold to detect phase changes,the threshold is changed dynamically. If an exploration phase results inpicking the same cache size as before, the noise threshold is increasedto discourage such needless explorations. Likewise, every interval spentin the stable state causes a slight decrement in the noise threshold incase it bad been set to too high a value.

The miss rate threshold ensures that larger cache sizes are exploredonly if required. Note that a high miss rate need not necessarily have alarge impact on performance because of the ability of dynamicsuperscalar processors to hide L2 latencies. That could result in a fewneedless explorations.

The intolerance metrics only serve as a guide to help limit the searchspace. Exploration is expensive and should preferably not be pursuedunless there is a possible benefit. Clearly, such an interval-basedmechanism is best suited to programs which can sustain uniform behaviorfor a number of intervals. While switching to an unstable state, step1121 also moves to the smallest L1 cache configuration as a form of“damage control” for programs having irregular behavior. That choiceensures that for those programs, more time is spent at the smaller cachesizes and hence performance is similar to that using a conventionalcache hierarchy. In addition, the mechanism keeps track of how manyintervals are spent in stable and unstable states. If it turns out thattoo much time is spent exploring, the program behavior is not suited toan interval-based scheme, and the smallest sized cache is retained.

Earlier experiments used a novel hardware design to estimate the hit andmiss latency intolerance of an application's phase (which the selectionmechanism just set forth attempts to minimize). Those estimates werethen used to detect phase changes as well as to guide exploration. Ascurrent results show in comparison to those of the inventors' previousexperiments, the additional complexity of the hardware is not essentialto obtaining good performance. Presently, it is envisioned that theselection mechanism would be implemented in software, although, as notedabove, it could be implemented in hardware instead. Every 100 K cycles,a low-overhead software handler will be invoked which examines thehardware counters and updates the state as necessary. That imposesminimal hardware overhead as the state can be stored in memory andallows flexibility in terms of modifying the selection mechanism. Thecode size of the handler is estimated to be only 120 static assemblyinstructions, only a fraction of which are executed during eachinvocation, resulting in a net overhead of less than 0.1%. In terms ofhardware overhead, roughly 9 20-bit counters are needed for the numberof misses, loads, cycles, instructions, and branches, in addition to astate register. That amounts to fewer than 8,000 transistors, and mostprocessors already come equipped with some such performance counters.

One such early experiment will now be described. To every entry in theregister map table, one bit is added which indicates whether the given(logical) register is to be written by a load instruction. In addition,for every entry in the Register Update Unit (RUU), which is a unifiedqueue and re-order buffer structure which holds all instructions whichhave dispatched and not committed, one bit is added per operand whichspecifies whether the operand is produced by a load (which can bededuced from the additional register map table bits) and anotherspecifying whether the load was a hit (the initial value upon insertioninto the RUU) or a miss. Every cycle, that information is used todetermine how many instructions were stalled by an outstanding load.Each cycle, every instruction in the RUU which directly depends on aload increments one of two global intolerance counters if (i) alloperands except for the operand produced by a load are ready, (ii) afunctional unit is available, and (iii) there are free issue slots inthat cycle. For every cycle in which those conditions are met up to thepoint that the load-dependent instruction issues, the hit intolerancecounter is incremented unless a cache miss is detected for the loadwhich on it is dependent; if such a miss occurs, the hit/miss bit isswitched and the miss intolerance counter is incremented each cycle thatthe above three conditions are met until the point at which theinstruction issues. If more than one operand of an instruction isproduced by a load, a heuristic is used to choose the hit/miss bit ofone of the operands. Simulations have been performed which choose theoperand corresponding to the load which issued first. That schemerequires only very minor changes to existing processor structures andtwo additional performance counters, and yet it provides a very accurateassessment of the relative impact of the hit time and the miss time ofthe current cache configuration on actual execution time of a givenprogram phase.

The metric just described has limitations in the presence of multiplestalled instructions due to loads. Free issue slots may bemis-categorized as hit or miss intolerance if the resulting dependencechains were to converge. That mis-categorization of lack of ILPmanifests itself when the converging dependence chains are of differentlengths. Multiple dependence chains go on to converge, and each chaincould have a different length. The program is usually limited by thelonger chain, i.e., stalling the shorter chain for a period of timeshould not affect the execution time. Hence, the number of program stallcycles should be dependent on the stall cycles for the longer dependencechain. The chain on the critical path is difficult to compute atruntime. The miss and hit intolerance metrics effectively add the stallsfor both chains and in practice work well. For TLB characterization, thepreferred embodiment implements a simple TLB miss handler cycle counterdue to the fact that in the model used, the pipeline stalls while a TLBmiss is serviced (assuming that TLB miss handling is done in software).TLB usage is also tracked by counting the number of TLB entries accessedduring a specified period.

Large L1 caches have a high hit rate, but also have higher access times.To arrive at the cache configuration which is the optimal trade-offpoint between the cache hit and miss times, the preferred embodimentuses a simple mechanism which uses past history to pick a size for thefuture, based on CPI as the performance metric.

The cache hit and miss intolerance counters indicate the effect of agiven cache organization on actual execution time. Large caches tend tohave higher hit intolerance because of the greater access time, butlower miss intolerance due to the smaller miss rate. Those intolerancecounters serve as a hint to indicate which cache configurations toexplore and as a rule of thumb, the best configuration is often the onewith the smallest sum of hit and miss intolerance. To arrive at thatconfiguration dynamically at runtime, a simple mechanism is used whichuses past history to pick a size for the future.

In addition to cache reconfiguration, the TLB configuration is alsoprogressively changed as shown in the flow chart of FIG. 12. The changeis performed on an interval-by-interval basis, as indicated by steps1201 and 1215. A counter tracks TLB miss handler cycles in step 1203. Instep 1205, a single bit is added to each TLB entry which is set toindicate whether it has been used in an interval (and is cleared atstart of an interval). If the counter exceeds a threshold (which iscontemplated to be 3%, although those skilled in the art will be able toselect the threshold needed) of the total execution time counter for aninterval, as determined in step 1207, the L1 TLB cache size is increasedin step 1209. In step 1211, it is determined whether the TLB usage isless than half. If so, the L1 TLB cache size is decreased in step 1213.

For the cache reconfiguration, an interval size of 100 K cycles waschosen so as to react quickly to changes without letting the selectionmechanism pose a high cycle overhead. For the TLB reconfiguration, alarger one million cycle interval was used so that an accurate estimateof TLB usage could be obtained. A smaller interval size could result ina spuriously high TLB miss rate over some intervals, and/or low TLBusage. For both the cache and the TLB, the interval sizes areillustrative rather than limiting, and other interval sizes can be usedinstead.

A miss in the first-level cache causes a lookup in the backup ways (thesecond level of the exclusive cache). Applications whose working setdoes not fit in the 2 MB of on-chip cache will often not find data inthe L2 section. Such applications might be better off bypassing the L2section lookup altogether. Previous work has investigated bypassing inthe context of cache data placement, i.e., they selectively choose notto place data in certain levels of cache. In contrast, the preferredembodiment bypasses the lookup to a particular cache level. Once thedynamic selection mechanism has reached the stable state, the L2 hitrate counter is checked. If that is below a particular threshold, the L2lookup is bypassed for the next interval. If that results in a CPIimprovement, bypassing continues. Bypassing a level of cache would meanpaying the cost of flushing all dirty lines first. That penalty can bealleviated in a number of ways —(i) do the writebacks in the backgroundwhen the bus is free, and until that happens, access the backup andmemory simultaneously; (ii) attempt bypassing only after contextswitches, so that fewer writebacks need to be done.

As previously mentioned, the interval-based scheme will work well onlyif the program can sustain its execution phase for a number ofintervals. That limitation may be overcome by collecting statistics andmaking subsequent configuration changes on a per-subroutine basis. Thefinite state machine used for the interval-based scheme is now employedfor each subroutine. That requires maintaining a table with CPI valuesat different cache sizes and the next size to be picked for a limitednumber of subroutines (100 in the present embodiment). To focus on themost important routines, only those subroutines are monitored whoseinvocations exceed a certain threshold of instructions (1000 in thepresent embodiment). When a subroutine is invoked, its table is lookedup, and a change in cache configuration is effected depending on thetable entry for that subroutine. When a subroutine exits, it updates thetable based on the statistics collected during that invocation. A stackis used to checkpoint counters on every subroutine call so thatstatistics can be determined for each subroutine invocation.

Two subroutine-based schemes were investigated. In the non-nestedapproach, statistics are collected for a subroutine and its callees.Cache size decisions for a subroutine are based on those statisticscollected for the call-graph rooted at that subroutine. Once the cacheconfiguration is changed for a subroutine, none of its callees canchange the configuration unless the outer subroutine returns. Thus, thecallees inherit the size of their callers because their statisticsplayed a role in determining the configuration of the caller. In thenested scheme, each subroutine collects statistics only for the periodwhen it is the top of the subroutine call stack. Thus, every singlesubroutine invocation is looked upon as a possible change in phase.Those schemes work well only if successive invocations of a particularsubroutine are consistent in their behavior. A common case where that isnot true is that of a recursive program. That situation is handled bynot letting a subroutine update the table if there is an outerinvocation of the same subroutine, i.e., it is assumed that only theoutermost invocation is representative of the subroutine and thatsuccessive outermost invocations will be consistent in their behavior.

If the stack used to checkpoint statistics overflows, it is assumed thatfuture invocations will inherit the size of their caller for thenon-nested case, and the minimum sized cache will be used for the nestedcase. While the stack is in a state of overflow, subroutines will beunable to update the table. If a table entry is not found while enteringa subroutine, the default smallest sized cache is used for thatsubroutine for the nested case.

Because the simpler non-nested approach generally outperformed thenested scheme, results will be reported below only for the former.

Energy-aware modifications will now be disclosed. There are twoenergy-aware modifications to the selection mechanisms. The first takesadvantage of the inherently low-energy configurations (those withdirect-mapped 512 KB and two-way set associative 1 MB L1 caches). Withthat approach, the selection mechanism simply uses those configurationsin place of the 768 KB 3-way L1 and 1 MB 4-way L1 configurations.

A second potential approach is to serially access the tag and dataarrays of the L1 data cache. Conventional L1 caches always performparallel tag and data lookup to reduce hit time, thereby reading dataout of multiple cache ways and ultimately discarding data from all butone way. By performing tag and data lookup in series, only the data wayassociated with the matching tag can be accessed, thereby reducingenergy consumption. Hence, the second low-energy mode operates just likethe interval-based scheme as before, but accesses the set-associativecache configurations by serially reading the tag and data arrays.

L1 caches are inherently more energy-hungry than L2 caches because theydo parallel tag and data access, as a result of which, they lookup morecache ways than actually required. Increasing the size of the L1 asdescribed thus far would result in an increase in energy consumption inthe caches. The natural question is—does it make sense to attemptreconfiguration of the L2 so that CPI improvement can be got without theaccompanying energy penalty?

Hence, the present cache design can be used as an exclusive L2-L3, inwhich case the size of the L2 is dynamically changed. The selectionmechanism for the L2/L3 reconfiguration is very similar to the simpleinterval-based mechanism for the L1/L2 described above. In addition,because it is assumed that the L2 and L3 caches (both conventional andconfigurable) already use serial tag/data access to reduce energydissipation, the energy-aware modifications would provide no additionalbenefit for L2/L3 reconfiguration. (Recall that performing the taglookup first makes it possible to turn on only the required data waywithin a subarray, as a result of which, all configurations consume thesame amount of energy for the data array access.) Finally, the TLBreconfiguration was not simultaneously examined so as not to vary theaccess time of the fixed L1 data cache. Much of the motivation for thosesimplifications was due to the expectation that dynamic L2/L3 cacheconfiguration would yield mostly energy saving benefits, due to the factthat the L1 cache configuration was not being altered (the organizationof which has the largest memory performance impact for mostapplications). To further improve energy savings at minimal performancepenalty, the search mechanism was also modified to pick a larger sizedcache if it performed almost as well (within 95% in our simulations) asthe best performing cache during the exploration, thus reducing thenumber of transfers between the L2 and L3.

In summary, the dynamic mechanisms just set forth estimate the needs ofthe application and accordingly pick an appropriate cache and TLBconfiguration. Hit and miss intolerance metrics were introduced whichquantify the effect of various cache sizes on the program's executiontime. Those metrics provide guidance in the exploration of various cachesizes, making sure that a larger size is not tried unless missintolerance is sufficiently high. The interval-based method collectsthose statistics every 100 K cycles and based on recent history, picks asize for the future. The subroutine-based method does that for everysubroutine invocation. To reduce energy dissipation, the selectionmechanism is kept as it is, but the cache configurations available to itare changed, i.e., the energy efficient low-associativity caches orcaches that do serial tag and data lookup are used. The same selectionmechanism is also applied to the L2/L3 reconfiguration. The abovetechniques will now be evaluated.

Simplescalar-3.0 was used for the Alpha AXP instruction set to simulatean aggressive 4-way superscalar out-of-order processor. Thearchitectural parameters used in the simulation are summarized in Table1:

Fetch queue entries 8 Branch predictor combination of bimodal andtwo-level share; bimodal/share level ½ entries- 2048, 1024 (hist. 10),4096 (global); respectively; combining pred. entries- 1024; RASentries-32; BTB-2048 sets, 2-way Branch misprediction latency 8 cyclesFetch, decode, issue width 4 RUU and LSQ entries 64 and 32 L1 I-cache2-way; 64 kB (0.1 μm), 32 kB (0.035 μm) Memory latency 80 cycles (0.1μm), 114 cycles (0.035 μm) Integer ALUs/mult-div 4/2 FP ALUs/mult-div2/1

The data memory hierarchy is modeled in great detail. For thereconfigurable cache, the 2 MB of on-chip cache is partitioned as atwo-level exclusive cache, where the size of the L1 is dynamicallypicked. It is organized as two word-interleaved banks, each of which canservice up to one cache request every cycle. It is assumed that theaccess is pipelined, so a fresh request can issue after half the time ittakes to complete one access. For example, contention for all caches andbuses in the memory hierarchy as well as for writeback buffers ismodeled. The line size of 128 bytes was chosen because it yielded a muchlower miss rate for our benchmark set than smaller line sizes.

As shown in FIG. 3, the minimum cache is 256 KB and direct mapped, whilethe largest is 2 MB 4-way, the access times being 2 and 4.5 cycles,respectively. The minimum sized TLB has 64 entries, while the largest is512. For both configurable and conventional TLB hierarchies, a TLB missat the first level results in a lookup in the second level. A miss inthe second level results in a call to a TLB handler that is assumed tocomplete in 30 cycles. The page size is 8 KB.

The configurable TLB is not like an inclusive 2-level TLB in that thesecond level is never written to. It is looked up in the hope of findingan entry left over from a previous configuration with a larger level oneTLB. Hence it is much simpler than the conventional two-level TLB of thesame size.

A variety of benchmarks from SPEC95, SPEC2000, and the Olden suite havebeen used. Those particular programs were chosen because they have highmiss rates for the L1 caches considered. For programs with low missrates for the smallest cache size, the dynamic scheme affords noadvantage and behaves like a conventional cache. The benchmarks werecompiled with the Compaq cc, f77, and f90 compilers at an optimizationlevel of O3. Warmup times were determined for each benchmark, and thesimulation was fast-forwarded through those phases. The window size waschosen to be large enough to accommodate at least one outermostiteration of the program, where applicable. A further millioninstructions were simulated in detail to prime all structures beforestarting the performance measurements. Table 2 below summarizes thebenchmarks and their memory reference properties (the L1 miss rate andload frequency).

64K-2way % Simulation L1 of instrs Bench- window miss that are markSuite Datasets (instrs) rate loads em3d Olden 20,000 1000M-1100M 20% 36%nodes, arity 20 health Olden 4 levels,  80M-140M 16% 54% 1000 iters mstOlden 256 nodes entire  8% 18% program 14M compress SPEC95 ref1900M-2100M 13% 22% INT hydro2d SPEC95 ref 2000M-2135M  4% 28% FP apsiSPEC95 ref 2200M-2400M  6% 23% FP swim SPEC2000 ref 2500M-2782M 10% 25%FP art SPEC2000 ref  300M-1300M 16% 32% FP

With regard to timing and energy estimation, the inventors investigatedtwo future technology feature sizes: 0.1 and 0.035 μm. For the 0.035 μmdesign point, cache latency values were used whose model parameters arebased on projections from the Semiconductor Industry AssociationTechnology Roadmap. For the 0.1 μm design point, the cache and TLBtiming model developed by McFarland are used to estimate timings forboth the configurable cache and TLB, and the caches and TLBs of aconventional L1/L2 hierarchy. McFarland's model contains severaloptimizations including the automatic sizing of gates according toloading characteristics, and the careful consideration of the effects oftechnology scaling down to 0.1 μm technology. The model integrates afully-associative TLB with the cache to account for cases in which theTLB dominates the L1 cache access path. That occurs, for example, forall of the conventional caches that were modeled as well as for theminimum size L1 cache (direct mapped 256 KB) in the configurableorganization.

For the global wordline, local wordline, and output driver select wires,cache and TLB wire delays are recalculated using RC delay equations forrepealer insertion. Repeaters are used in the configurable cache as wellas in the conventional L1 cache whenever they reduce wire propagationdelay. The energy dissipation of those repeaters was accounted for aswell, and they add only 2-3% to the total cache energy.

Cache and TLB energy dissipation were estimated using a modified versionof the analytical model of Kamble and Ghose. That model calculates cacheenergy dissipation using similar technology and layout parameters asthose used by the timing model (including voltages and all electricalparameters appropriately scaled for 0.1 μm technology). The TLB energymodel was derived from that model and included CAM match lineprecharging and discharging, CAM wordline and bitline energydissipation, as well as the energy of the RAM portion of the TLB. Formain memory, only the energy dissipated due to driving the off-chipcapacitive busses was included.

For all L2 and L3 caches (both configurable and conventional), theinventors assume serial tag and data access and selection of only one of16 data banks at each access, similar to the energy-saving approach usedin the Alpha 21164 on-chip L2 cache. In addition, the conventional L1caches were divided into two subarrays, only one of which is selected ateach access. That is identical to the smallest 64 KB section accessed inone of the four configurable cache structures with the exception thatthe configurable cache reads its full tags at each access (to detectdata in disabled subarrays/ways). Thus, the conventional cache hierarchyagainst which the reconfigurable hierarchy was compared was highlyoptimized for both fast access time and low energy dissipation.

Detailed event counts were captured during Simple Scalar simulations ofeach benchmark. Those event counts include all of the operations thatoccur for the configurable cache as well as all TLB events, and are usedto obtain final energy estimations.

Table 3 below shows the conventional and dynamic L1/L2 schemessimulated: A Base excl. cache with 256 KB 1-way L1 & 1.75 MB 14-way L2 BBase incl. cache with 256 KB 1-way L1 & 2 MB 16-way L2 C Base incl.cache with 64 KB 2-way L1 & 2 MB 16-way L2 D Interval-based dynamicscheme E Subroutine-based with nested changes F Interval-based withenery-aware cache configurations G Interval-based serial tag and dataaccess

The dynamic schemes of the preferred embodiment will be compared withthree conventional configurations which are identical in all respects,except the data cache hierarchy. The first uses a two-levelnon-inclusive cache, with a direct mapped 256 KB L1 cache backed by a14-way 1.75 MB L2 cache (configuration A). The L2 associativity resultsfrom the fact that 14 ways remain in each 512 KB structure after two ofthe ways are allocated to the 256 KB L1 (only one of which is selectedon each access). Comparison of that scheme with the configurableapproach demonstrates the advantage of resizing the first level. Theinventors also compare the preferred embodiment with a two-levelinclusive cache which consists of a 256 KB direct mapped L1 backed by a16-way 2 MB L2 (configuration B). That configuration serves to measurethe impact of the non-inclusive policy of the first base case onperformance (a non-inclusive cache performs worse because every missresults in a swap or writeback, which causes greater bus and memory portcontention.) Another comparison is with a 64 KB 2-way inclusive L1 and 2MB of 16-way L2 (configuration C), which represents a typicalconfiguration in a modern processor and ensures that the performancegains for the dynamically sized cache are not obtained simply by movingfrom a direct mapped to a set associative cache. For both theconventional and configurable L2 caches, the access time is 15 cyclesdue to serial tag and data access and bus transfer time, but ispipelined with a new request beginning every four cycles. Theconventional TLB is a two-level inclusive TLB with 64 entries in thefirst level and 448 entries in the second level with a 6 cycle lookuptime.

For L2/L3 reconfiguration, the interval-based configurable cache iscompared with a conventional three-level on-chip hierarchy. In both, theL1 cache is 32 KB two-way set associative with a three cycle latency,reflecting the smaller L1 caches and increased latency likely requiredat 0.035 μm geometries. For the conventional hierarchy, the L2 cache is512 KB two-way set associative with a 21 cycle latency and the L3 cacheis 2 MB 16-way set associative with a 60 cycle latency. Serial tag anddata access is used for both L2 and L3 caches to reduce energydissipation.

The inventors will first evaluate the performance and energy dissipationof the L1/L2 configurable schemes versus the three conventionalapproaches using delay and energy values for 0.1 μm geometries. It willthen be demonstrated how L2/L3 reconfiguration can be used at finer0.035 μm geometries to dramatically improve energy efficiency relativeto a conventional three-level hierarchy but with no compromise ofperformance.

FIGS. 5 and 6 show the memory CPI and total CPI, respectively, achievedby the conventional and configurable interval and subroutine-basedschemes for the various benchmarks. The memory CPI is calculated bysubtracting the CPI achieved with a simulated system with a perfectcache (all hits and one cycle latency) from the CPI with the memoryhierarchy. In comparing the arithmetic mean (AM) of the memory CPIperformance, the interval-based configurable scheme outperforms thebest-performing conventional scheme (B) (measured in terms of apercentage reduction in memory CPI) by 27%, with roughly equal cache andTLB contributions as is shown in Table 4 below:

Cache TLB Cache TLB contribution contribution explorations changes em3d 73% 27% 10 2 health  33% 67% 27 2 mst 100%  0% 5 3 compress  64% 36% 542 hydro2d 100%  0% 19 0 apsi 100%  0% 63 27 swim  49% 51% 5 6 art 100% 0% 11 5

For each application, that table also presents the number of cache andTLB explorations that resulted in the selection of different sizes. Interms of overall performance, the interval-based scheme achieves a 15%reduction in CPI. The benchmarks with the biggest memory CPI reductionsare health (52%), compress (50%), apsi (31%), and mst (30%).

The dramatic improvements with health and compress are due to the factthat particular phases of those applications perform best with a largeL1 cache even with the resulting higher hit latencies (for which thereis reasonably high tolerance within those applications). For health, theconfigurable scheme settles at the 1.5 MB cache size for most of thesimulated execution period, while the 768 KB configuration is chosen formuch of compress's execution period. Note that TLB reconfiguration alsoplays a major role in the performance improvements achieved. Those twoprograms best illustrate the mismatch that often occurs between thememory hierarchy requirements of particular application phases and theorganization of a conventional memory hierarchy, and how anintelligently-managed configurable hierarchy can better match on-chipcache and TLB resources to those execution phases. Note that while someapplications stay with a single cache and TLB configuration for most oftheir execution window, others demonstrate the need to adapt to therequirements of different phases in each program (see Table 4).Regardless, the dynamic schemes are able to determine the best cache andTLB configurations, which span the entire range of possibilities, foreach application during execution. Note also, that even though theinventors did not run the applications to completion, 3-4 applicationphases in which a different configuration was chosen were typicallyencountered during the execution of each of the eight programs.

The results for art and hydro2d demonstrate how the dynamicreconfiguration may in some cases degrade performance. Thoseapplications are very unstable in their behavior and do not remain inany one phase for more than a few intervals. Art also does not fit in 2MB, so there is no size which causes a sufficiently large drop in CPI tomerit the cost of exploration. However, the dynamic scheme identifiesthat the application is spending more time exploring than in stablestate and rums exploration off altogether. Because that happens earlyenough in case of art (the simulation window is also much larger), anshows no overall performance degradation, while hydro2d has a slight 3%slowdown. That result illustrates that compiler analysis to identifysuch “unstable” applications and override the dynamic selectionmechanism with a statically-chosen cache configuration may bebeneficial.

Comparing the interval and subroutine-based schemes shows that thesimpler interval-based scheme usually outperforms the subroutine-basedapproach. The most notable exception is apsi, which has inconsistentbehavior across intervals (as indicated by the large number ofexplorations in Table 4), causing it to thrash between a 256 KB L1 and a768 KB L1. The subroutine-based scheme significantly improvesperformance relative to the interval-based approach as each subroutineinvocation within apsi exhibits consistent behavior from invocation toinvocation. Yet, due to the overall results and the additionalcomplexity of the subroutine-based scheme, the interval-based schemeappears to be the most practical choice and is the only schemeconsidered in the rest of the analysis.

In terms of the effect of TLB reconfiguration, health, swim, andcompress benefit the most from using a larger TLB. Health and compressperform best with 256 and 128 entries, respectively, and the dynamicscheme settles at those sizes. Swim shows phase change behavior withrespect to TLB usage, resulting in five stable phases requiring either256 or 512 TLB entries. A slight degradation in performance results fromthe configurable TLB in some of the benchmarks, because of the fact thatthe configurable TLB design is effectively a one-level hierarchy using asmaller number of total TLB entries since data is not swapped betweenthe primary and backup portions when handling TLB misses.

Those results demonstrate potential performance improvement for onetechnology point and microarchitecture. In order to determine thesensitivity of our qualitative results to different technology pointsand microarchitectural trade-offs, the processor pipeline speed wasvaried relative to the memory latencies (keeping the memory hierarchylatency fixed). The results in terms of performance improvement weresimilar for 1 (the base case), 1.5, and 2 GHz processors.

Energy-aware configuration results will now be set forth. The focus willbe on the energy consumption of the on-chip memory hierarchy (includingthat to drive the off-chip bus). The memory energy per instruction(memory EPI, with each energy unit measured in nanojoules) results ofFIG. 7 illustrate how as is usually the case with performanceoptimizations, the cost of the performance improvement due to theconfigurable scheme is a significant increase in energy dissipation.That is caused by the fact that energy consumption is proportional tothe associativity of the cache and our configurable L1 uses largerset-associative caches. For that reason, the inventors explore how theenergy-aware improvements may be used to provide a more modestperformance improvement yet with a significant reduction in memory EPIrelative to a pure performance approach.

FIG. 7 shows that merely selecting the energy-aware cache configurations(scheme F) has only a nominal impact on energy. In contrast, operatingthe L1 cache in a serial tag and data access mode (G) reduces memory EPIby 38% relative to the baseline interval-based scheme (D), bringing itin line with the best overall-performing conventional approach (B). Forcompress and swim, that approach even achieves roughly the same energy,with significantly better performance (see FIG. 8) than conventionalconfiguration C, whose 64 KB two-way L1 data cache activates half theamount of cache every cycle than the smallest L1 configuration (256 KB)of the configurable schemes. In addition, because the selection schemeautomatically adjusts for the higher hit latency of serial access, thatenergy-aware configurable approach reduces memory CPI by 13% relative tothe best-performing conventional scheme (B). Thus, the energy-awareapproach may be used to provide more modest performance improvements inportable applications where design constraints such as battery life areof utmost importance. Furthermore, as with the dynamic voltage andfrequency scaling approaches used today, that mode may be switched onunder particular environmental conditions (e.g., when remaining batterylife drops below a given threshold), thereby providing on-demandenergy-efficient operation.

To reduce energy, mechanisms such as serial tag and data access (asdescribed above) have to be used. Since L2 and L3 caches are oftenalready designed for serial tag and data access to save energy,reconfiguration at those lower levels of the hierarchy would notincrease the energy consumed. Instead, they stand to decrease it byreducing the number of data transfers that need to be done between thevarious levels, i.e., by improving the efficiency of the memoryhierarchy.

Thus, the energy benefits are investigated for providing a configurableL2/L3 cache hierarchy with a fixed L1 cache as on-chip cache delayssignificantly increase with sub-0.1 μm geometries. Due to theprohibitively long latencies of large caches at those geometries, athree-level cache hierarchy becomes an attractive design option from aperformance perspective. The inventors use the parameters from Agarwalet al, “Clock rate versus IPC: The end of the road for conventionalmicroarchitectures,” Proceedings of the 27th International Symposium onComputer Architecture, pages 282-292, June, 2000, for 0.035 μmtechnology to illustrate how dynamic L2/L3 cache configuration can matchthe performance of a conventional three-level hierarchy whiledramatically reducing energy dissipation.

FIGS. 9 and 10 compare the performance and energy, respectively, of theconventional three-level cache hierarchy with the configurable scheme.Recall that TLB configuration was not attempted so the improvements arecompletely attributable to the cache. Since the L1 cache organizationhas the largest impact on cache hierarchy performance, as expected,there is little performance difference between the two, as each uses anidentical conventional L1 cache. However, the ability of the dynamicscheme to adapt the L2/L3 configuration to the application results in a43% reduction in memory EPI on average. The savings are caused by theability of the dynamic scheme to use a larger L2, and thereby reduce thenumber of transfers between L2 and L3. Having only a two-level cachewould, of course, eliminate those transfers altogether, but would bedetrimental to program performance because of the large 60-cycle L2access. Thus, in contrast to that approach of simply opting for a lowerenergy, and lower performing, solution (the two-level hierarchy),dynamic L2/L3 cache configuration can improve performance whiledramatically improving energy efficiency.

The benchmarks were run with a perfect memory system (all data cacheaccesses serviced in 1 cycle) to estimate the contribution of the memorysystem to execution time. The difference in CPIs is referred to as thememory-CPI. Since the dynamic cache is only trying to improve memoryperformance, the memory-CPI quantifies the impact on memory performance,while CPI quantifies the impact on overall performance. While comparingenergy consumption of the various configurations, the inventors usemem-EPI (memory energy per instruction). To get an idea of overallperformance across all benchmarks, the inventors use 2 metrics—thegeometric mean (GM) of CPI speedups and the harmonic mean (HM) of IPCsand the corresponding values for the memory-CPI. Likewise, the inventorsuse the GM of EPI speedups (energy of base case/energy of configuration)and the HM of instruction per joule.

The preferred embodiment thus provides a novel configurable cache andTLB as an alternative to conventional cache hierarchies. Repeaterinsertion is leveraged to enable dynamic cache and TLB configuration,with an organization that allows for dynamic speed/size tradeoffs whilelimiting the impact of speed changes to within the memory hierarchy. Theconfiguration management algorithm is able to dynamically examine thetradeoff between an application's hit and miss intolerance using CPI asthe ultimate metric to determine appropriate cache size and speed. At0.1 μm technologies, our results show an average 15% reduction in CPI incomparison with the best conventional L1-L2 design of comparable totalsize, with the benefit almost equally attributable on average to theconfigurable cache and TLB. Furthermore, energy-aware enhancements tothe algorithm trade off a more modest performance improvement for asignificant reduction in energy. Projecting to 0.035 μm technologies anda 3-level cache hierarchy, improved performance can be shown with anaverage 43% reduction in memory hierarchy energy when compared to aconventional design. That latter result demonstrates that because theconfigurable approach significantly improves memory hierarchyefficiency, it can serve as a partial solution to the significant powerdissipation challenges facing future processor architects.

While a preferred embodiment of the present invention and variousmodifications thereof have been set forth in detail, those skilled inthe art will readily appreciate that other embodiments can be realizedwithin the scope of the invention. For example, recitations of specifichardware or software should be construed as illustrative rather thanlimiting. The same is true of specific interval times, thresholds, andthe like. Therefore, the present invention should be construed aslimited only by the appended claims.

1. A method of reconfiguring a data cache for caching data in acomputing device, the data cache operating at a plurality of levels in amemory hierarchy and comprising a portion having a variable sizeoperating at a first level of the plurality of levels, the methodcomprising: (a) storing performance information for the data cache; (b)determining, from the performance information, whether the data cachehas a miss rate exceeding a threshold; (c) determining whether thevariable size is equal to a maximum size; and (d) if the miss rateexceeds the threshold and the variable size is not equal to the maximumsize, controlling the data cache to increase the variable size.
 2. Themethod of claim 1, further comprising: (e) if the miss rate does notexceed the threshold or the variable size is equal to the maximum size,(i) determining, from the performance information, an optimal data cacheconfiguration which optimizes a number of cycles per instruction in thecomputing device and (ii) setting the data cache to the optimal datacache configuration.
 3. The method of claim 2, wherein, in each of aplurality of time periods during which the data cache operates, steps(a)-(c) and one of steps (d) and (e) are performed.
 4. The method ofclaim 3, wherein each of the time periods is a fixed number of cycles ofthe computing device.
 5. The method of claim 3, wherein each of the timeperiods is a time period in which the computing device performs asubroutine.
 6. The method of claim 3, wherein: the data cache isdesignated as either stable or unstable; and steps (a)-(c) are performedonly during intervals in which the data cache is designated as unstable.7. The method of claim 6, further comprising, during intervals in whichthe data cache is designated as stable: (f) determining, from theperformance information, whether the data cache is actually unstable;and (g) if the data cache is actually unstable, (i) designating the datacache as unstable and (ii) setting the variable size to a minimum value.8. The method of claim 7, wherein: the performance indication comprisesa hit counter for a second portion of the data cache which is outsidethe portion having the variable size; and when the data cache isdesignated as stable and the hit counter is below a hit counterthreshold, the second portion of the data cache is bypassed.
 9. Themethod of claim 1, wherein: the data cache comprises tag arrays and dataarrays; the first level is L1; and in the portion having the variablesize, the tag arrays and the data arrays are read in series.
 10. Amethod of reconfiguring a translation look-aside buffer for use in acomputing device, the translation look-aside buffer having a variablesize, the method comprising: (a) storing performance information for thetranslation look-aside buffer; (b) determining, from the performanceinformation, whether the translation look-aside buffer has a miss rateexceeding a first threshold; (c) determining, from the performanceinformation, whether the translation look-aside buffer has a usage lessthan a second threshold; (d) if the miss rate exceeds the firstthreshold, controlling the translation look-aside buffer to increase thevariable size; and (e) if the use is less than the second threshold,controlling the translation look-aside buffer to decrease the variablesize.
 11. The method of claim 10, wherein, in each of a plurality oftime periods during which the data cache operates, steps (a)-(c) and oneof steps (d) and (e) are performed.
 12. The method of claim 11, whereineach of the time periods is a fixed number of cycles of the computingdevice.
 13. A method for configuring a cache, comprising: storingperformance information for a data cache having at least one portionwith a variable size, wherein the data cache is configured to operate ata plurality of levels in a memory hierarchy; determining, from theperformance information, whether a miss rate for the data cache exceedsa threshold; and if the miss rate exceeds the threshold, increasing thevariable size.
 14. The method of claim 13, further comprising:determining whether the variable size is equal to a maximum size; andincreasing the variable size if the variable size is determined to beless than a maximum size.
 15. The method of claim 14, further comprisingnot increasing the variable size if the variable size is determined tobe at least the maximum size.
 16. The method of claim 13, furthercomprising: if the miss rate does not exceed the threshold or thevariable size is equal to the maximum size; determining, from theperformance information, an optimal data cache configuration whichoptimizes a number of cycles per instruction in the computing device;and setting the data cache to the optimal data cache configuration. 17.A non-transitory tangible computer-readable medium having instructionsstored thereon, the instructions comprising: instructions to storeperformance information for a data cache having at least a portionthereof with a variable size, wherein the data cache is configured tooperate at a plurality of levels in a memory hierarchy; instructions todetermine, from the performance information, whether a miss rate for thedata cache exceeds a threshold; and instructions to increase thevariable size in response to the miss rate exceeding the threshold. 18.The non-transitory tangible computer-readable medium of claim 17 ,further comprising: instructions to determine whether the variable sizeis equal to a maximum size; and instructions to increase the variablesize if the variable size is determined to be less than a maximum size.19. The non-transitory tangible computer-readable medium of claim 18 ,further comprising instructions to not increase the variable size if thevariable size is determined to be at least the maximum size.
 20. Thenon-transitory tangible computer-readable medium of claim 17 , furthercomprising: if the miss rate does not exceed the threshold or thevariable size is equal to the maximum size; instructions to determine,from the performance information, an optimal data cache configurationwhich optimizes a number of cycles per instruction in the computingdevice; and instructions to set the data cache to the optimal data cacheconfiguration.
 21. A method, comprising: storing performance informationfor a translation look-aside buffer having a variable size; determiningfrom the performance information whether a miss rate for the translationlook-aside buffer exceeds a first threshold; and if the miss rateexceeds the first threshold, increasing the variable size.
 22. Themethod of claim 21, further comprising: determining from the performanceinformation whether the translation look-aside buffer has a usage lessthan a second threshold; and if the use is less than the secondthreshold, controlling the translation look-aside buffer to decrease thevariable size.
 23. A non-transitory machine readable medium havingstored thereon instructions that, if executed by a processor, result ina method comprising: storing performance information for a translationlook-aside buffer having a variable size; determining from theperformance information whether a miss rate for the translationlook-aside buffer exceeds a first threshold; and if the miss rateexceeds the first threshold, increasing the variable size.
 24. Thenon-transitory machine readable medium of claim 23 , further comprising:determining from the performance information whether the translationlook-aside buffer has a usage less than a second threshold; and if theuse is less than the second threshold, controlling the translationlook-aside buffer to decrease the variable size.